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Abstract 

A variety of genome-wide profiling techniques are available to investigate complementary aspects of genome struc- 
ture and function. Integrative analysis of heterogeneous data sources can reveal higher level interactions that 
cannot be detected based on individual observations. A standard integration task in cancer studies is to identify 
altered genomic regions that induce changes in the expression of the associated genes based on joint analysis of 
genome-wide gene expression and copy number profiling measurements. In this review, we highlight common 
approaches to genomic data integration and provide a transparent benchmarking procedure to quantitatively com- 
pare method performances in cancer gene prioritization. Algorithms, data sets and benchmarking results are avail- 
able at http://intcomp.r-forge.r-project.org. 
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INTRODUCTION 

Genome-wide profiling technologies, in particular 
microarrays and next-generation sequencing, are 
used to characterize disease-associated changes at 
various levels of genome function. Identification of 
the key players — genes, chromosomal regions or 
biological processes — is a fundamental step toward 
mechanistic characterization of the disease and 
revealing molecular targets for potential therapeutic 
intervention. Genomic, transcriptomic, epigenomic 
and proteomic measurements characterize different 
aspects of genome regulation and function that are 



particularly relevant for cancer research [1, 2]. 
Integrative analysis has been used to prioritize disease 
genes or chromosomal regions for experimental test- 
ing, to discover disease subtypes [3, 4] or to predict 
patient survival or other clinical variables [5]. 
Co-occurring genomic observations are increasingly 
available in private and public repositories, such as 
the Cancer Genome Atlas database [6] and the 
Leukemia Gene Atlas [7], promoting wide access 
to data resources. However, the lack of algorithmic 
implementations forms a bottleneck hampering inte- 
grative approaches. 
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The integration of gene expression (GE) and copy 
number (CN) data to identify DNA CN alterations 
that induce changes in the expression levels of the 
associated genes is a common task in cancer studies 
[8]. The detection of chromosomal regions with 
exceptionally high statistical association between 
CN and GE can pinpoint disease genes and potential 
cancer mechanisms [9, 10]. First, high-throughput 
analyses were reported about a decade ago [11—13], 
evidencing a clear ris-dosage effect of CN alterations 
on GE levels [14—16]. Although the downstream 
effect of CN alteration on GE is still a focus of on- 
going research [17, 18], a systematic quantitative 
comparison of alternative approaches for integrating 
GE/CN data has been missing, as recently high- 
lighted by Huang et al [8]. Hence, we designed a 
quantitative benchmarking procedure to compare 
12 publicly available methods for cancer gene priori- 
tization based on integrative analysis of CN/ GE pro- 
filing data on two simulated and three real case 
studies. In the following sections, we give a meth- 
odological overview, introduce the analysis pipeline 
and discuss the benchmarking results. 

QUANTIFYING ASSOCIATIONS 
BETWEEN GE AND CN 

The available implementations for the integrative 
analysis of GE and CN can be roughly divided in 
four main categories. In this section, we provide a 
general overview of these approaches with further 
references to individual algorithms. 

Two-step approaches 

A comparison of GE levels between groups of sam- 
ples with distinct CN status aims at revealing 
CN-induced transcriptional responses. Several 
approaches separately either first assess the alterations 
in each data set and then compare the results from 
both or assess alterations in GE in genes or genomic 
regions previously identified by an assessment of CN 
alterations to model changes in GE based on the CN 
signals [16, 19]. This corresponds to the biological 
intuition concerning the m-regulatory effect of CN 
alterations. In the first step, samples and genes are 
grouped based on estimated CN levels, estimated 
probabilities of CN alterations [20] or quantiles 
[21]. In the second step, differential GE is quantified 
either between such groups or independently (with 
respect to a reference sample) using standard 
approaches for GE analysis such as the f-test which 



assesses the difference between two sample groups 
based on Gaussian assumptions [13]. Nonparametric 
[20, 22] and permutation-based alternatives [23, 24, 
36] have also been suggested to relax the normality 
assumptions of the i-test. Cancer-associated changes 
often affect chromosomal regions with varying sizes, 
which potentially contain multiple genes. Therefore, 
some methods have been designed to specifically 
detect large regions affected by CN alteration 
rather than prioritize individual genes [19, 24]. 
Nevertheless, the regional modeling of GE and 
CN data can help to pinpoint individual driver 
genes whose expression is most notably affected by 
a larger chromosomal alteration. 

Regression approaches 

Another class of tools uses regression models, gener- 
ally with CN as the predictor and GE as the response 
variable, again exploiting the biological intuition 
concerning the m-regulatory effect of CN alter- 
ations. Both linear [12] and nonlinear regression 
models [25] have been proposed. Univariate linear 
regression models have been designed to model the 
associations between individual CN and GE probes 
[26], as well as multiple and/or multivariate linear 
regression models that combine statistical power 
across multiple probes targeting adjacent genes or 
chromosomal positions [14, 26—28]. Regression 
models are theoretically related to correlation ana- 
lysis. For instance, the square of Pearson's correlation 
coefficient estimates the proportion of variance in 
the response variable that is explained by the pre- 
dictor in a univariate linear regression. In case, vari- 
ables are standardized beforehand, the regression 
coefficient of the predictor variable equals Pearson's 
correlation coefficient. 

Correlation-based approaches 

DR-Correlate [21] and a modified version of 
Ortiz-Estevez algorithm [16] use correlation-based 
analysis to scan over the genome and detect loci 
with exceptionally high associations between CN/ 
GE. To address potential shortcomings with respect 
to a biologically inadequate reflection of CN and GE 
abnormalities by ordinary correlation analysis, 
Schafer et al. [29] substitute sample means by the 
reference medians, and Lipson et al. [30] use 
quantile-based analysis to obtain improved correl- 
ation coefficients. Furthermore, canonical correlation 
analysis (CCA) has been suggested to identify general 
linear associations between CN and GE data through 
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flexible detection of weighted combinations of 
probes, which reveal maximal correlations between 
the two data sources. This is expected to more effi- 
ciently distinguish the relevant shared variation of the 
GE/CN data from the data set-specific effects [34]. 
Various modifications for dimensionality reduction 
and model regularization have also been proposed 
based on principal component analysis [31] and 
penalized approaches based on LASSO, elastic net 
or other constraints to obtain sparse or regularized 
versions of CCA [5, 32—34]. Although regularization 
may reduce overfitting and sparsity can simplify in- 
terpretation of the results, setting the appropriate 
regularization parameters may be a challenging task. 

Latent variable models 

Latent variable approaches are used to model directly 
the data-generating processes. For instance, the pint/ 
simcca algorithm [34] decomposes GE and CN data 
sets into shared and independent Gaussian compo- 
nents based on regularized probabilistic CCA. A 
comparison of the shared and data set-specific signals 
is used to pinpoint chromosomal regions with ex- 
ceptionally high levels of dependence between the 
GE/CN observations. Related matrix decompos- 
ition models and iterative, dependence-seeking pro- 
jections have been suggested based on generalized 
singular value decomposition [3] and independent 
component analysis [35]. The advantage of latent 
variable models in comparison with the two-step-, 
correlation- or regression-based approaches is that 
they explicitly model both the signal and noise in 
the data, and take into account the uncertainty in 
the model by integrating over the unknown latent 
variables. These properties help distinguishing signal 
from noise in a robust manner, but often come at an 
increased computational cost. 

BENCHMARKING THE 
ALGORITHMS 

Manual literature search in PubMed and Google 
Scholar using combinations of the keywords 'gene 
expression', 'copy number', 'integration' and inspec- 
tion of the Bioconductor repository (http://www 
.bioconductor.org) were performed to identify avail- 
able implementations, yielding 12 algorithms that 
were applicable for cancer gene prioritization based 
on integrative analysis of GE/CN data (Table 1). 
The source code for Ortiz-Estevez [16] was obtained 
from the authors. An automated benchmarking 



pipeline was created to compare method perform- 
ance on two simulated data sets and three real case 
studies (http:/ /intcomp.r-forge.r-project.org). 

Each method was used to prioritize candidate 
cancer genes, followed by a comparison with a 
golden standard list of known cancer genes, and 
ranking of the methods based on receiver operating 
characteristic (ROC) analysis of the prioritized gene 
lists and running times. Investigating the true positive 
rate among the top findings complemented the 
standard area under curve (ROC/AUC) analysis, 
which considers the overall prioritized gene list. 
Default parameters for each method were used 
where possible. The following exceptions were 
made to apply the algorithms to cancer gene priori- 
tization. In DR-Correlate [21], empirical P- values 
from 1000 random gene permutations were used 
to rank the genes. The DR-Correlate f-test option 
was not applicable on the Ferrari simulations due to 
the low number of replicate samples. CNAmet 
[24, 36] requires called CN values and provides sep- 
arate lists for amplifications and deletions; thus, the 
two lists were pooled and ranked based on the 
P-values. Moreover, to enable an unbiased AUC 
comparison of CNAmet with all other methods 
(that prioritize all genes), random ranks were assigned 
to genes labeled by CNAmet with no P-value 
(nonsignificant genes). With intCNGEan [20], the 
weighted Mann— Whitney test with univariate ana- 
lysis was used with an effective P-value threshold of 
0.1. In pint/simcca [34], segmented CN data were 
used only when the resolution of the CN platform 
was higher than the resolution of the GE microarray. 
In PREDA/SODEGIR, we used 'spline' for 
smoothing, 1000 random gene orderings of the 
output regions and the median AUC as an unbiased 
output for gene prioritization. 

For all methods, GE and CN probes were 
matched by selecting for each GE probe the closest 
CN probe within the same chromosomal arm. 
One-to-one matching between the GE and CN 
data was required in the real case studies [34, 37]; 
in simulation experiments, the original simulation 
procedures [19, 29] were followed as described 
below. The preprocessing of CN data depends par- 
tially on the platform resolution. On the latest 
high-density SNP arrays, for instance, segmentation 
strategies are essential for estimating the CN for in- 
dividual genes [8]. Various approaches consider to 
investigate only certain genomic regions at a time, 
e.g. to avoid bias, and propose different strategies to 
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Table I: Summary of the comparison algorithms 



Implementation 


CN preprocessing 


Methodology 


Significance scoring 


Reference 


CNAmet (R) 


Called 


Custom statistic; 


PPT; aberrant regions 


[24] 






Two step 




[36] 


DR-Correlate/t-test (BC) 


Raw/segmented 


Two step 


PPT; P-values 


[21] 


DR-Correlate (BC) 


Raw/segmented 


COR 


PPT; P-values 


[21] 


edira (R) 


Raw/segmented 


Custom statistic; 


NT; P-values 


[29] 






COR 






intCNGEan (R) 


cghCall object 


Custom statistic; 


PNT; P-values 


[20] 






Two step 






Ortiz-Estevez (R) 


Raw/segmented 


Two step 


PNT; P-values 


[16] 


PMA (CRAN) 


Raw/segmented 


LV; COR 


PLV; P-values 


[56] 


PREDA/SODEGIR (BC) 


Raw/segmented 


Custom statistic; 


PPT; aberrant regions/ 


[19] 






Two step 


q-values 


[48] 


pint/simcca 


Raw/segmented 


LV; COR 


PLV; P-values 


[34] 


SIM (BC) 


Raw/segmented 


REG 


PT; P-values 


[26] 



The implementations are available through Bioconductor (BC); CRAN or R source code (R).The CN preprocessing methods required by each algo- 
rithm are listed. COR, correlation analysis; REG, regression analysis; LV, latent variables analysis; PT, parametric test; NT, nonparametric test; 
PNT, permutation test based on statistic of nonparametric test; PPT, permutation test based on statistic of parametric test; PLV, permutation 
test based on latent variable score. 



select the size of the chromosomal region, including 
fixed windows in terms of consecutive probes or base 
pairs [28, 30, 34], chromosome arms or minimal 
common regions [26] or performing kernel regres- 
sion [19], where the probe signals are modeled with a 
smoothing function which accounts for the 
nonuniform distribution of the genes along the 
genome. 

Simulated data 

Two simulated data sets were generated by roughly 
following Schafer et al. ([29]; 'Schafer' data) and 
Bicciato et al. ([19]; 'Ferrari' data). The simulations 
are based on general assumptions regarding the asso- 
ciations between the (altered) CN and GE signals in 
genome-wide profiling studies, as detailed in the ori- 
ginal publications. For the 'Schafer' data set, CN and 
GE values are drawn from a normal mixture where 
two components represent aberrations of different 
extent for each locus; 100 samples were created for 
each input with mixing proportions of either 10% or 
90% for the affected and normal regions. Varying 
noise levels were imposed using multiple variance 
parameters (0.25, 0.5, 1, 2 and 4 times an adjusted 
median absolute deviation of the data). The data 
points are organized in 16 equally sized blocks to 
mimic affected regions. The 'Ferrari' data with six 
samples was created by manipulating a renal cell car- 
cinoma data set through permutation of loci and 
adding or subtracting constants to both CN and 



GE values within 10 blocks of 10 Mbp. Normal 
control data was generated by subtracting the 
median across the samples [19]. 

Real case studies 

We investigated two publicly available breast cancer 
data sets [12, 13] and a leukemia study [38]. 
Expert-curated lists of known breast cancer genes 
[39] and leukemia genes from the Cancer Gene 
Census [40] were used as the ground truth for the 
benchmarking experiments, respectively. The pre- 
processed 'Hyman' data set [13] contains 14 breast 
cancer cell lines, 7489 genes and 48 known breast 
cancer genes. The preprocessed 'Pollack' data set [12] 
contains 41 breast cancer samples, 4287 genes and 38 
known breast cancer genes. The preprocessed 
'Mullighan' data set consists of 171 acute lympho- 
blastic leukemia (ALL) samples divided into 9 sub- 
types [38, 41], 2162 genes in the matched CN/GE 
data and 39 known leukemia genes. A combination 
of standard algorithms was used to preprocess the 
500 K Affymetrix CN data [42-44] and the 
Affymetrix GE data [45-47] for the Mullighan data 
set. The CN data (Affymetrix Human Mapping 
500 K) was downloaded from ftp://ftp.studje.org 
and normalized with CRMA v2 [42]. The 
log-additive model from the CRMA vl algorithm 
[43] was used for probe summarization. Data values 
from the Nsp and Sty array of the 500 K set were 
combined and segmented with CBS [44]. 
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GE profiles of the same ALL specimens, measured 
with the Affymetrix HG-U133A platform, were ob- 
tained from GEO (GSE12995; [45]) and prepro- 
cessed with the RPA algorithm [46] and 
EntrezID-based custom chip definition file (vl3; 
[47]). The reference for GE and CN data was 
defined as the median normalized log ratios across 
all samples. In all data sets, probes with no 
EntrezID or location information and probes map- 
ping to multiple locations or in sex chromosomes 
were excluded. Missing values were imputed by 
Gaussian random samples using the mean and vari- 
ance of the data. 



RESULTS 

The cancer gene prioritization performance of the 
comparison methods as quantified by the AUC ana- 
lysis is summarized in Figure 1 (for the ROC curves, 
see Supplementary Figure SI). The highest median 
ranking across the five benchmarking data sets was 
obtained by edira (1), followed by Ortiz-Estevez (4) 
and pint/simcca (4). Each of these three methods 
outperformed the others on at least one data set. 
Note that the performance of edira with the 
'Schafer' data set and of PREDA/SODEGIR with 
the 'Ferrari' data set needs to be carefully interpreted, 
since these simulations were originally constructed to 
follow the particular modeling assumptions of these 
algorithms in the original publications [19, 29]. The 



complete benchmarking results are available at the 
project website. 

Considering the true-positive rate among the 
top 200 genes of each algorithm, pint/simcca had 
the highest median ranking (1), followed by 
edira, Ortiz-Estevez and PREDA/SODEGIR (3; 
Supplementary Figure S2). These methods had sys- 
tematically the highest median rankings with mul- 
tiple thresholds (20, 50 and 100 top genes). Notably, 
although edira and PREDA/SODEGIR had the 
highest AUC scores on the Schafer data, most of 
other algorithms outperformed these methods with 
respect to known true positives among the top find- 
ings in this data set. 

Differences regarding the running times were con- 
siderable (Supplementary Table SI). Specifically, 
edira and PMA were the fastest methods with less 
than 1 min running time in all data sets, closely fol- 
lowed by Ortiz-Estevez with a maximum running 
time of <3min. The number of permutations in sig- 
nificance testing affects remarkably the running times 
of CNAmet, DR-Correlate, intCNGEan and 
PREDA/SODEGIR, although in the latest version 
of PREDA/SODEGIR a parallelized version has 
been implemented to reduce computation time [48]. 

DISCUSSION 

Prioritization of disease genes is a key-modeling task 
in functional genomics [49—52]. This review pro- 
vides an overview and quantitative benchmarking 
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Figure I: AUC values in ROC analysis quantify cancer gene prioritization performance of the methods for the five 
benchmarking data sets. High values indicate high true-positive versus false-positive ratio among the top findings; 
the dashed line indicates the expected AUC value for a random gene list (AUC = 0.5). The methods have been 
ordered by their median rank across all data sets. For the ROC curves, see Supplementary Figure SI. 
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of publicly available algorithms for detecting associ- 
ations between GE and CN alterations. Our work 
complements the recent review by Huang et al. [8], 
who pointed out the lack of quantitative compari- 
sons of the available methods. The 'intcomp' bench- 
marking package applied in this review is freely 
available at R-forge (http://intcomp.r-forge 
.r-project.org) to facilitate transparent comparisons 
and the addition of new algorithms, benchmarking 
procedures and validation data sets. 

The comparison of 12 algorithms with respect to 
their cancer gene prioritization performance revealed 
systematic differences across independent data 
sets, preprocessing scenarios and sample sizes. 
Interestingly, while no systematic differences be- 
tween the four main categories of GE/CN integra- 
tion approaches were seen, systematic differences 
between individual methods were evident. In par- 
ticular, edira, Ortiz-Estevez and pint/simcca consist- 
ently outperformed the other methods. Considering 
both relative performance and running time, edira 
and Ortiz-Estevez seem to offer an optimal trade-off, 
although all methods have acceptable running times 
for practical apphcations. While none of the methods 
outperformed the others in all data sets, identification 
of the few best-performing implementations pro- 
vides quantitative guidance for the selection of ana- 
lysis tools and has therefore direct practical relevance 
for cancer studies. 

Benchmarking the algorithms on real data is cru- 
cial since simulation studies are unlikely to capture all 
complexities present in real data. However, the avail- 
ability of suitable benchmarking data sets is limited. 
We selected publicly available data sets in which 
both GE and CN data from the same samples are 
available and independent lists of known cancer 
genes obtained from the literature. The model per- 
formance is in general better in the simulation stu- 
dies, compared to the real cancer data sets, suggesting 
that manually curated cancer gene lists may be only 
coarse approximations of the ground truth in the real 
case studies and that simulations may have lower 
noise levels. On the other hand, simulation proced- 
ures are only rough approximations of the biological 
reality and the simulation schema can remarkably 
affect model performance. For instance, variants of 
DR-Correlate and CNAniet performed well with 
'Schafer' simulated data, but their performance 
dropped close to random expectation in the 
'Ferrari' data set. The 'Ferrari' simulations assume 
that the CN effect is visible in all tumor samples, 



which can be particularly disadvantageous for 
DR-Correlate and other methods that rely on vari- 
ations between the aberration profiles across the sam- 
ples. The 'Ferrari' and 'Schafer' simulated data sets 
were originally designed to evaluate the perform- 
ances of PREDA/SODEGIR and edira methods, 
and this aspect potentially causes positive bias 
on these methods in the respective data sets. 
Moreover, certain methods, such as CNAmet [36], 
Ortiz-Estevez [16] or PREDA/SODEGIR [19], 
have originally been designed to prioritize altered 
chromosomal regions rather than individual genes. 
Our benchmarking procedure is based on the priori- 
tization of individual genes since this is the most 
prevalent objective shared by the available GE/CN 
integration algorithms. 

Since chromosomal CN alterations represent a key 
feature of cancer, well-performing GE/CN analysis 
methods are expected to have a good prioritization 
performance of known cancer genes. However, cer- 
tain cancer genes may be overlooked by integrative 
approaches that focus only on simultaneous changes 
in both GE and CN levels since gene activity is also 
affected by cellular mechanisms other than GE/CN 
alterations. For such reason, it was not un-expected 
that 33—73% of the known cancer genes were not 
included among the first 200 prioritized genes by any 
comparison method in the five benchmarking data 
sets. The relatively low number (0—8) of the known 
cancer genes among the first 200 findings in the real 
case studies highlights the need for efficient 
approaches to identify key mutations and genes 
that drive cancer development and progression 
[23]. Moreover, although any algorithm detected 
certain cancer genes, none of the known cancer 
genes was detected by all methods in any bench- 
marking data set among the first 200 findings. 
Since different methods emphasize different aspects 
of the GE/CN data, efficient joint analysis of the 
results from multiple independent methodologies 
might outperform individual methods. One could, 
for instance, consider mean or median ranks across 
the prioritized lists, or weight the different lists 
according to certain criteria. Related approaches 
have been suggested elsewhere [49], but have not 
been investigated in the context of GE/ CN analysis 
yet. In our experiments, straightforward ranking of 
the genes based on their mean or median rank across 
the different methods did not outperform the 
best-performing methods in any benchmarking 
data set. 
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The choice of preprocessing and model param- 
eters can have a remarkable effect on the results. 
The key decisions in the context of GE/CN data 
are associated with selecting the CN preprocessing 
approach [53], size of the investigated chromosomal 
regions and the matching approach for the inte- 
grated data sets. These and related issues are exten- 
sively discussed in the recent review by Huang 
et al, [8] . It is also possible to utilize class informa- 
tion of the samples, for instance, by including both 
tumor and reference samples [21]. However, in 
many cases, the references are included as a 
pooled control for two-color microarray experi- 
ments but not as a separate group, as with the 
Hyman and Pollack data sets. Moreover, genomic 
aberrations often affect only a subset of the cancer 
patients, and multiple cancer subtypes may be pre- 
sent, as in the Mullighan data set. The matching 
approach for GE/CN data may also affect the re- 
sults. In the current pipeline, each GE probe is 
matched to the closest CN probe or segment. 
Requiring one-to-one matching of the GE/CN 
data may lead to exclusion of many GE probes 
in particular on high-density arrays such as in the 
Mullighan data set. The publicly available bench- 
marking pipeline will allow further experimenta- 
tion with alternative preprocessing scenarios. All 
data presented in this study come from microarray 
studies, where several matched GE/CN data sets 
are available from public sources, but the approach 
should be in principle applicable also to 
high-throughput sequencing data. Since the under- 
lying biological phenomena remain unaltered, and 
methodological approaches proposed for GE/CN 
integration are based on relatively general modeling 
assumptions, it can be expected that the proposed 
methods are applicable also in the context of 
next-generation sequencing after appropriate data 
preprocessing. 

Further integrative tasks in GE/CN analysis would 
include modeling of trans-regulatory effects of CN 
aberrations on genes outside the affected region 
[54, 55], disease subtype discovery [4], prediction 
of patient survival or of clinical covariates [56] and 
integrative analysis of other data sources, such as 
methylation [57], microRNA [58—59] or protein 
expression [60]. However, fewer implementations 
for such tasks are currently available. Availability of 
reference implementations would facilitate bench- 
marking and optimizing new algorithms. The 
benchmarking pipeline introduced in this review 



can be adjusted to incorporate additional algorithms 
and data sets as they become available. 

CONCLUSION 

A variety of methods is available for the integrative 
analysis of GE and CN data. The algorithms can be 
classified as two-step, regression, correlation-based 
and latent variable approaches. Implementation qual- 
ity, running time and accuracy of the algorithm, as 
well as preprocessing, sample size and availability of 
control samples need to be considered when select- 
ing the appropriate method. The benchmarking 
pipeline reveals systematic differences in cancer 
gene prioritization performance of available imple- 
mentations across five case studies. 

SUPPLEMENTARY DATA 

Supplementary Data are available online at 
http://bib.oxfordjournals.org/. 
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Key Points 

• Integrative analysis algorithms for GE and CN data include 
two-step, regression, correlation-based and latent variable 
approaches. 

• The benchmarking pipeline reveals systematic differences in 
cancer gene prioritization performance of currently available 
implementations. 

• Implementation quality, running time and accuracy of the algo- 
rithm, as well as data preprocessing, sample size and availability 
of control samples need to be considered when selecting the 
analysis approach. 
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Training Group Statistical Modeling). S.B. is sup- 
ported from AIRC Special Program Molecular 
Clinical Oncology '5 per mille'. 
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